# Multimodal pre-training
MAKE
A zero-shot skin disease assessment model based on vision-language pre-training technology, integrating multi-faceted knowledge enhancement to provide an effective tool for skin disease research and diagnosis.
Text-to-Image
M
xieji-x
108
2
Style 250412.vit Base Patch16 Siglip 384.v2 Webli
A vision model based on the Vision Transformer architecture, trained using SigLIP (Sigmoid Loss for Language-Image Pretraining), suitable for image understanding tasks.
Image Classification
Transformers

S
p1atdev
66
0
Comp SigLIP So400M
Apache-2.0
CoMP-MM-1B is a visual foundation model (VFM) that supports native image resolution input, continuously pre-trained based on SigLIP.
Multimodal Fusion
C
SliMM-X
33
1
Aimv2 Large Patch14 448.apple Pt
AIM-v2 is an image feature extraction model based on the timm library, designed with large patches for high-resolution image processing.
Image Classification
Transformers

A
timm
68
0
Aimv2 3b Patch14 448.apple Pt
AIM-v2 is an image encoder model based on the timm library, with a 3B parameter scale, suitable for image feature extraction tasks.
Image Classification
Transformers

A
timm
79
0
Aimv2 3b Patch14 336.apple Pt
AIM-v2 is an image encoder model based on the timm library, suitable for image feature extraction tasks.
Image Classification
Transformers

A
timm
35
0
Aimv2 1b Patch14 336.apple Pt
AIM-v2 is an image encoder model developed by Apple, based on a timm-compatible architecture, suitable for image feature extraction tasks.
Image Classification
Transformers

A
timm
65
0
Resnet101 Clip Gap.openai
Apache-2.0
ResNet101 image encoder based on CLIP framework, extracting image features through Global Average Pooling (GAP)
Image Classification
Transformers

R
timm
104
0
Resnet50x4 Clip Gap.openai
Apache-2.0
ResNet50x4 variant model based on the CLIP framework, designed for image feature extraction
Image Classification
Transformers

R
timm
170
0
Resnet50 Clip Gap.openai
Apache-2.0
A ResNet50 variant based on the visual encoder part of the CLIP model, extracting image features through Global Average Pooling (GAP)
Image Classification
Transformers

R
timm
250
1
Vit Huge Patch14 Clip Quickgelu 378.dfn5b
Other
ViT-Huge image encoder based on CLIP architecture, trained on DFN5B dataset, supports quick GELU activation
Image Classification
Transformers

V
timm
27
0
Vit Huge Patch14 Clip 378.dfn5b
Other
The visual encoder component of DFN5B-CLIP, based on ViT-Huge architecture, trained with 378x378 resolution images for CLIP model
Image Classification
Transformers

V
timm
461
0
Vit Base Patch16 Clip 224.dfn2b
Other
Vision Transformer model based on CLIP architecture, featuring DFN2B-CLIP image encoder weights released by Apple
Image Classification
Transformers

V
timm
444
0
Vit So400m Patch14 Siglip Gap 896.pali2 10b Pt
Apache-2.0
Vision model based on SigLIP image encoder with global average pooling, part of the PaliGemma2 model
Text-to-Image
Transformers

V
timm
57
1
Vit Base Patch16 Siglip 256.webli
Apache-2.0
A ViT-B-16 image encoder model based on SigLIP, using original attention pooling, suitable for image feature extraction tasks.
Image Classification
Transformers

V
timm
269
1
Vit Huge Patch14 Clip 224.laion2b
Apache-2.0
ViT-Huge visual encoder based on the CLIP framework, trained on the laion2B dataset, supports image feature extraction
Image Classification
Transformers

V
timm
1,969
0
Vit Base Patch32 Clip 256.datacompxl
Apache-2.0
Vision Transformer model based on CLIP architecture, specialized in image feature extraction with support for 256x256 resolution input
Image Classification
Transformers

V
timm
89
0
Vit Base Patch32 Clip 224.laion2b
Apache-2.0
Vision Transformer model based on CLIP architecture, designed for image feature extraction, trained on the laion2B dataset
Image Classification
Transformers

V
timm
83
0
Vit Base Patch32 Clip 224.datacompxl
Apache-2.0
Vision Transformer model based on CLIP architecture, designed for image feature extraction, trained using the DataComp XL dataset
Image Classification
Transformers

V
timm
13
0
Vit Base Patch16 Clip 224.datacompxl
Apache-2.0
A vision Transformer model based on the CLIP architecture, specifically designed for image feature extraction, using ViT-B/16 structure and trained on the DataComp XL dataset
Image Classification
Transformers

V
timm
36
0
Convnext Xxlarge.clip Laion2b Soup
Apache-2.0
ConvNeXt-XXLarge image encoder based on the CLIP framework, trained by LAION, suitable for multimodal tasks
Image Classification
Transformers

C
timm
220
0
Convnext Base.clip Laiona
Apache-2.0
ConvNeXt Base model based on the CLIP framework, trained on the LAION-Aesthetic dataset, suitable for image feature extraction tasks.
Image Classification
Transformers

C
timm
14
0
Vit Huge Patch14 Clip 224.metaclip Altogether
CLIP model based on ViT-Huge architecture, supporting zero-shot image classification tasks
Image Classification
V
timm
171
1
Vit Base Patch16 Clip 224.laion400m E31
MIT
Vision Transformer model trained on LAION-400M dataset, supporting zero-shot image classification tasks
Image Classification
V
timm
1,469
0
Vit Base Patch32 Clip 224.laion400m E32
MIT
Vision Transformer model trained on LAION-400M dataset, compatible with both OpenCLIP and timm frameworks
Image Classification
V
timm
5,957
0
Resnet50 Clip.cc12m
MIT
CLIP model with ResNet50 architecture trained on the CC12M dataset, supporting zero-shot image classification tasks
Image Classification
R
timm
233
0
Resnet50 Clip.yfcc15m
MIT
ResNet50 model trained on the YFCC-15M dataset, compatible with both open_clip and timm frameworks, supporting zero-shot image classification tasks.
Image Classification
R
timm
631
0
Siglip So400m Patch14 224
Apache-2.0
SigLIP is an improved multimodal model based on CLIP, employing a superior Sigmoid loss function, pre-trained on the WebLI dataset, and suitable for tasks such as zero-shot image classification and image-text retrieval.
Text-to-Image
Transformers

S
google
6,654
53
Vit Xsmall Patch16 Clip 224.tinyclip Yfcc15m
MIT
A compact vision-language model based on CLIP architecture, designed for efficient zero-shot image classification
Image Classification
V
timm
444
0
Vit B 16 Aion400m E32 1finetuned 1
MIT
Vision Transformer model based on OpenCLIP framework, fine-tuned for zero-shot image classification tasks
Image Classification
V
Albe-njupt
18
1
Internvit 6B 448px V1 2
MIT
InternViT-6B-448px-V1-2 is a foundational vision model with a feature backbone, comprising 55.4 million parameters, supporting image processing at 448x448 pixels.
Text-to-Image
Transformers

I
OpenGVLab
19
27
Siglip Base Patch16 384
Apache-2.0
SigLIP is a multimodal model pre-trained on the WebLi dataset, employing an improved sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.
Image-to-Text
Transformers

S
google
2,570
10
Siglip Base Patch16 256
Apache-2.0
SigLIP is a vision-language model pre-trained on the WebLi dataset, employing an improved Sigmoid loss function, excelling in image classification and image-text retrieval tasks.
Text-to-Image
Transformers

S
google
12.71k
5
Protst Esm1b
The ProtST framework enhances the pre-training and understanding of protein sequences through biomedical text, constructs the ProtDescribe dataset, designs three pre-training tasks, and supports supervised learning and zero-shot prediction.
Protein Model
Transformers

P
mila-intel
173
1
Altclip M18
AltCLIP-m18 is a CLIP model supporting 18 languages for image-text matching tasks.
Text-to-Image
Transformers

A
BAAI
58
5
CLIP Convnext Xxlarge Laion2b S34b B82k Augreg Rewind
MIT
A CLIP ConvNeXt-XXLarge model trained on the LAION-2B dataset, implemented using the OpenCLIP framework, focusing on zero-shot image classification tasks.
Text-to-Image
C
laion
63
2
Lilt Roberta En Base
MIT
Language-independent Layout Transformer (LiLT) provides a LayoutLM-like model for any language by combining pre-trained RoBERTa (English) with a pre-trained language-independent layout transformer (LiLT).
Text Recognition
Transformers

L
SCUT-DLVCLab
12.05k
19
Featured Recommended AI Models